Putting multicore processing in context: Part One

Putting multicore processing in context: Part One
By Todd Brian, Embedded.com
O=2 9 2006 (9:00 AM)
URL: http://www.embedded.com/showArticle.jhtml?articleID=175802474

Today’s embedded devices are increasingly based on multi-core designs. Systems-on-a-chip (SoCs) often contain two or more processor cores in homogeneous or heterogeneous combinations, and FPGA-based designs can include a virtually unlimited number and variety of cores. An asymmetric multiprocessing (AMP)-based RTOS is one approach to utilizing multi-core processors; symmetric multiprocessing (SMP) is another.

But from a historical perspective, multiprocessor/multicore designs have been around almost as long as the computer in general. I believe that Burroughs introduced the first commercially available one back in 1962; many more have followed. Most of them had very specific purposes in mind.

Many of the early ones were used in data centers, and for scientific applications. You can also make a case for them being in the early desktops of the late 80s with the introduction of the math co-processor and device controllers such as an Ethernet controller.

The question is: why are embedded device manufacturers becoming so enamored with multi-core devices now? One reason is that embedded devices have more and more tasks being heaped on them.

My first cell phone was a StarTek. It had a rudimentary LCD display and frankly was not very useful for anything other than making a phone call. It had a 24-hour standby time and maybe an hour talk time once the battery was broken in. That was not a real problem though since my first plan was for 30 minutes a month.

But look at today’s phones and the demands made on them: built-in cameras, video capabilities, nice graphical interfaces for browsing the Web, calendars and all sorts of games. It is no wonder that designers are looking at ways to get more performance, and better form factor while extending battery life.

Why multicores? Adam Smith's answer
One answer to my question may lie in economic theory, specifically Adam Smith’s “Wealth of Nations.” It opens with a dramatic recommendation that rather than each worker performing all tasks required for pin making: rolling the wire, cutting the wire into pieces, sharpening the wire and bobbing the end, each worker should specialize in performing a single task. One worker would roll the wire; another would sharpen the wire, another bob the ends and so on.

Smith suggested that a pin factory that had adopted such a "specialization of labor" might produce tens of thousands of pins a day whereas a pin factory in which each worker attempted to produce pins, from start to finish, would produce very few pins.

Heterogeneous multicore designs such as the OMAP processor, which employ both a DSP and an ARM core, embody Adam Smith’s concepts of both division of labor and a specialization of labor. This approach is a MGMulticorePanel good one for manufacturers of cell phones and other electronic devices, which perform a multitude of roles: the DSP efficiently performs signal-processing activities and the ARM core performs the role of a general-purpose processor. (Panel to right)

Since power consumption is both a function of voltage and frequency (power consumption increases proportionally with frequency and power consumption increases to the square of voltage increase, the division of labor aspect of multiprocessing is of particular interest to manufacturers of portable devices.

Most portable electronic devices do not operate at peak capacity of their processors all the time. However, to handle peak demand, manufacturers have to provide a processor that can deliver the processing power to meet peak demands. While the processor can be throttled down between periods of peak performance, this may not be an option for devices that require operation in an intermediate performance range. For most processors they must either run at full speed or be in an idle state.

This is where adding a number of slower processors may be appealing to device manufacturers. If the issue of peak performance can be segmented so that multiple processors can address peak performance needs, they can then be throttled down or put to sleep during periods of inactivity. For periods where performance demands are low, one or two of the low performance processors can provide satisfactory processing power.

The limits on your multicore design are essentially the same as the constraints on classical multiprocessing systems: Amdahl’s law; serial code execution; code execution granularity; load-imbalance; synchronization; and cache coherency.

Illustrated in Figure 1 above, Amdahl’s Law states, “The non-parallel fraction of the code (i.e., overhead) imposes the upper limit on the scalability of the code”. Thus, the maximum parallel Speedup S(p) for a program that has a parallel fraction f:

S(p) < 1/(1-f)

The non-parallel (serial) fraction of the program includes the communication and synchronization overhead.

So according to Amdahl, if you have a program or a suite of tasks that is comprised of say 50 percent serial code, no matter how many processors you add, you will only get a performance increase of 200 percent. On the other hand, if your code contains no serial code, like four tasks that have no communication requirements, no shared resources, etc, then you can achieve a 1000 percent speed up by going from one processor to 10.

Serial code execution
What is serial execution? Execution is considered serial when one or more cores in a multi-core system are not executing code. This can occur due to a variety of reasons. Sometimes it is because there is no reason for code to be executing on multiple cores. Basically, the system is idle. From a performance perspective, this is nothing to worry about. However, there are a number of situations where it would be desirable for all of the processors to be executing code, but can’t because of data or resource conflict, load-imbalance, synchronization or the need to maintain cache coherency.

Spinlocks. One prime example of how code becomes serialized is when there is the potential for processes running on different processors to attempt to gain access to a resource that is already in use by another process (Figure 2, below).

OS developers use what is called a Spinlock to act as a gatekeeper for a resource. A Spinlock (sometimes called a Spinlock Mutex) is a special data structure used in multiprocessing. It is similar to a Mutex or a semaphore in that it protects access to resources that cannot be accessed by more than one process at a time. The primary difference between a Spinlock and a Mutex is that the Spinlock depends on a hardware ”test and set” mechanism that arbitrates between two or more processes attempting access at the same time.

When a process attempts to access a resource that is in use by another process, the “blocked” process(s) will “spin” until the resource becomes available; hence the name Spinlock. In essence, the processes queue up until they can get access to the resource. There are different algorithms that can be used to assign priority to processes waiting in a Spinlock. Regardless of the algorithm used, once a process gets access to a resource, they keep it until they decide to give it up.

The granularity of Spinlocks and the resources that they are used to protect has dramatic impact on the degree to which application code is serialized. Devices such as hard drives and other IO devices come to mind when I think of shared resources but, depending on the type of OS, kernel objects can also be protected by Spinlocks. The granularity of a Spinlock is a function of the number and type of resources that they are used to protect.

A coarse-grained implementation of a Spinlock would be used to protect all IO devices. So, for example, once a process gains access to any IO device, even a process that needs access to an unused resource will spin in the Spinlock until the lock is released. The other extreme would be to have a Spinlock protecting each shared resource in the system.

Each approach has its advantages. A coarse-grained approach is simple to implement, but has the potential to serialize the application. The fine-grained approach makes more resources available in parallel, but is more complicated and has the potential to exhibit problems such as deadlock and other classical problems associated with resource protection schemes used in single processor systems. This problem can be overcome to a large degree through the use of watchdog applications and other types of heuristics. However, it does add to the overall overhead of the system, so it comes at a cost.

One final note: Spinlocks don’t just protect system data or resources, they can also be applied to application data. For example, if some device configuration data consists of a string of bytes, it will need to be locked during update to ensure data integrity.

Granularity of Parallel Code Execution
There are two general types of coding approaches that you can take with multicore devices: fine-grained parallelism (in the case of the connection machine and others that have hundreds or thousands of processors – this is called massively parallel) and coarse-grained parallelism.

Fine-grained architectures (Table 1, above) have between two and a couple of dozen processors. An example of a massively parallel problem approach would be if I want to add 2048 numbers together and had 1024 processors to do it with. In the first iteration, all 1024 processors each add two numbers together. The second iteration of 512 processors adds the 1024 numbers together and so on. It would only take 11 iterations to add all 2048 numbers together in this fashion.

Course-grained programming (Table 2, above and Figure 3 below), on the other hand, approaches the problem at a level we are all more familiar with; segmenting our programs at the task level. Legacy code can be easily converted to run on multi-core architectures using the course-grained approach.

It is usually a good fit for engineering departments as teams are usually tasked at the application level. A good understanding of how the overall system works is necessary so that the different tasks are prioritized so that the execution is ordered in a way that optimizes performance. The down side of course-grained parallelism is that it may lead to load imbalance.

There are advantages and disadvantages to both approaches. The fine-grained approach brings massive amounts of processing power to bear on a problem. It also has the potential to utilize the processors more equally in a load balancing system. That being said, other than data plane applications there are very few control plane applications in the embedded space that could be easily implemented that would benefit from massive parallelism. Furthermore fine-grained segmentation is almost impossible to do without the use of a compiler that is explicitly designed to exploit massive parallelism.

On the other hand, if one has a compiler that is designed to exploit fine-grained parallelism, or you have hand written the code to exploit a large number of processors, then it is easy for all the processors to be utilized efficiently. Another disadvantage in exploiting fine-grained parallelism is that there is a large overhead to pay in terms of synchronizing and communication overhead.

Synchronization Overhead
In the context of parallel execution, synchronization overhead is the waiting and communicating incurred by interdependent tasks waiting for other tasks to intermediate results in order to begin execution again. Unlike waiting for a Spinlock to release because of resource contention, synchronization overhead is caused when a task cannot continue execution until an intermediate result is provided by another task.

In the sum of 2048 numbers example used above, delays incurred due to memory latency, interrupt or other factors cause the execution times to vary between the different processing elements. As a result of the different execution times, tasks that are dependent on these intermediate results essentially stall until the result is ready. As the parallelism of an application increases, the potential for synchronism overhead also increases. In general, there is a point for almost all problem domains where the addition of additional processing elements will cause a slow down in performance.

Load Imbalance
Fine-grained applications typically take advantage of doing the same activity many, many times in parallel. The example above used a loop unrolling approach. Course-grained segmentation approaches parallelism at the task level, because it is almost impossible to design and segment code in such a way that each task has the same execution time. The different execution times lead to what is called “load imbalance.”

As shown in Figure 4 below, if there are a discrete number of tasks that need to be performed and the execution times are unequal, some cores will be idle while others are still executing.

Cache Coherency in uniprocessor designs
Cache memory is used to speed execution of a microprocessor. Cache memory is typically implemented as a relatively small amount of high-speed memory. Unlike main memory, cache memory can satisfy memory reads and writes at processor frequency.

Main memory on the other hand typically takes two or more cycles to access. When a main memory fetch occurs, the process stalls. This is avoided by keeping anticipated or frequently used data and instructions in cache. The processor will only stall when the information it needs is not in cache.

One side effect of using this two-layer approach to memory access is that if cache is written to, then cache and main memory no longer contain the same information. When this occurs, cache and main memory are considered incoherent. There are different approaches that can address this problem. The two most common are “write through” and “write back.”

The simplest cache coherency mechanism is “write through.” When writing to cache using “write through” each time the cache is written to, it also writes through to main memory. Thus coherency is maintained. The down side is that the processor stalls while the write through completes.

A more advanced implementation for maintaining cache coherency tries to address stalling during a write by buffering up writes so that they occur less frequently. When the buffered writes are written to memory, they are written in “burst mode.” This technique is called “write back.”

The write back implementation for obvious reasons is more complicated than write through. The write back implementation requires that a list of “dirty” cache lines be maintained. The dirty entries indicate cache lines whose contents are not the same as in main memory.

When the cache loads a cache line from main memory, its contents are marked as clean. When the processor writes to a cache location, an entry is made to the cache list to indicate that the cache entry is dirty. Later, depending on when the “caching algorithm” decides, the dirty cache entries are written to main memory.

Both write through and write back schemes can be implemented either in hardware or in software. All modern commercial microprocessors I am aware of use hardware to enforce cache coherency.

Multi-core Cache Coherency
Utilizing cache in a multi-core design takes the problem of cache coherency to the next level. There are a number of multicore architectures on the market now that utilize shared memory. These architectures also utilize cache memory to speed execution.

Now, in addition to maintaining coherency between memory and a single cache, the cache coherency scheme must address simultaneous reads and writes to an individual processor’s cache while maintaining coherency with the shared memory as well with all the other cache memories that are used by the other processors.

The only efficient mechanism for doing this is by using cache coherency hardware. Regardless of the hardware assist method used to maintain cache coherency, using cache in multi-core systems tends to serialize process execution. The question is to what degree. In later articles, we will explore how different hardware architectures address this.

Todd Brian is product marketing manager at Accelerated Technology, a Mentor Graphics division.

Part Two of this series will cover various aspects of multiprocessing from an RTOS perspective, looking at two different approaches - symmetric (SMP) and asymmetric (AMP) multiprocessing - their benefits and limitations in real time, deterministic environments.

To learn more about this topic, go to More about multicores, multiprocessing and tools